Police Stop Data - Processing and exploration

The goal of this notebook is to create a dataframe with information relative to policing, demographics and public safety for the 90 neighborhoods of the city of Pittsburgh, Pennsylvania.

The datasets with information on Pittsburgh :

The final product is a 90 lines dataframe, one for each neighborhood, with the following columns:

Finding statistics by neighborhood implies averaging/counting points of the police dataset in each neighborhood.

 Scrip Summary

Merging the neighborhood shapefile with the general demography, criminality and income dataframes

 Exporting the neighborhood dataframe to shapefile

 Now that the neighborhood base shapefile is defined, we need to add the police stop informations. This means processing the data in this notebook, exporting is as a shapefile, and then processing it again in QGIS to add subject information on the neighborhood shapefile.

Processing the police dataset

We only keep data on the location, date, subject related information, police officer related information and stop reason and outcome information. We remove the raw columns as well.

 Adding the neighborhood to each police stops

To easily manipulate spatial data, we use QGIS (https://www.qgis.org/en/site/), a "A Free and Open Source Geographic Information System", which highly simplifies the following tasks.

The first step is to create a shapefile of our police dataset:

After importing the police stops and the neighborhood shapefile in QGIS, we can start processing the data:

First of all, we are only interested by a specific area of Pittsburgh, that is the main 90 Pittsburgh neighborhoods: https://en.wikipedia.org/wiki/List_of_Pittsburgh_neighborhoods. Therefore, we discard points which are not in this area. To do so, we use QGIS and its "Clip points in Polygon" (https://docs.qgis.org/2.8/en/docs/user_manual/processing_algs/gdalogr/ogr_geoprocessing/clipvectorsbypolygon.html) processing tool.

Then, we need to attribute to the police stops the name of the neighborhood the stop occured in. The Police Stop Dataset initially comes with information about the neighborhood, along with latitude and longitude of the stops. However, not all points have the neighborhood information, so we need to add it. To do so, we use the "Join Attribute by Location" procesing tool in QGIS (https://docs.qgis.org/2.8/en/docs/user_manual/processing_algs/qgis/vector_general_tools/joinatributesbylocation.html).

Finally, we can count the number of stop by neighborhood using the "Count Points in polygon" processing tool (https://docs.qgis.org/2.8/en/docs/user_manual/processing_algs/qgis/vector_analysis_tools/countpointsinpolygon.html). We can also divide the number of stops by the population in each neighborhood, as to compensate for the fact that high population density neighborhoods have more stops. This is done with the "field calculator" processing tool (https://docs.qgis.org/2.8/en/docs/user_manual/working_with_vector/field_calculator.html)

 Importing the modified shapefiles

The stop shapefile has a new column: "hood" the neighborhood name for each stop

The neighborhood shapefile has 2 new columns: the number of stops and the number of stops normalized by population.

Adding statistics on subject race distribution

Now that we know in which neighborhood each stop is, we can compute the subject race distribution. Moreover, we can compute the outcome of the stops by race.

Creating the final neighborhood dataframe

Check out the 12 new columns:

 Exporting as shapefile and as csv

 Exploratory Data Analysis

Computing the percentage of white and black individuals in the stopped dataset AND in the resident population

Scatter plot: stopped race proportion vs. population race proportion

Part 2: police officers

In this part we focus on the police officers: we will try to separate them into clusters and then analyse each cluster's stop behavior i.e. if they stop more blacks or whites.


Load, clean and prepare data


Clustering officers

We tested multiple approaches, they are all explained in this section.

Extract officer data and create a "profile" for each officer

We have pretty limited data on the officers available to us:

We decided to take the mean age of each officer across their stops, and capture their "experience" by taking the number of days between their first and last intervention, as well as the amount of stops they performed. We keep only officers that are black or white, as together they represent ~99% of the data. That way it can be transformed to a binary variable. This gives us the following "profile" for each officer:

First clustering technique: Using subset of columns

It seems that our data is separated using the color of the police officier and it's sex. We also note that the cluster of "White Male" is splitted in two, depending on the number of days worked and the number of stops.

One striking observation is the size difference of the clusters: we see that our algorithm did not split the "White Male" cluster enough to have clusters of similar sizes.

Let's have a look at the different features distributions:

We see that UMAP-DBSCAN clustered our data in 5 clusters:

However, this does not solve our problem: clustres 0 and 1 are way bigger than the other. Hence, we try to add variables to make further splits.

Second clustering technique: Using all columns

Now that our data is ready to be clustered, we use UMAP to create a 2d representation of our clusters

We see that our data is clustered in the same manner as the first technique: the split is even worse! We have now the white male cluster that is even bigger.

While looking at the arrest distribution, we notive something interesting:

We notice that some officers have a huge amount of arrest (the median is at 123 arrests, and some officers have 5365 or 6565 arrets)!

We believe that those officers have a huge imact on the clustering. Hence, we want to try to remove them to see if they have a real influence.

Third clustering technique: Excluding outliers

Clusters without outliers

We clearly see that the unbalanced cluster size is still present. Hence this technique is not necessarily better...

But are the outliers similar?

Outliers cluster

We clearly see that the majority of the outliers points cannot be clustered together). It implies that they are very different.

Note that we see that the officers that have the most arrests are clustered together, implying that when once one reach a certain number of arrest you will be part of this cluster.

Final clustering technique: manual clustering

After failing to produce satisfying clusters using other methods and having learned a lot about our data after handling it so much, we attempt manual clustering. \ First, we decide to keep only white male officers as they represent a huge part of the officers and also it removes the 2 variables that were always determining how the clusters were split: race and sex. \ We cut our white males' data into 3 groups for age_mean and 3 groups for days_interval. We split at the 0.33 quantile and the 0.66 quantile to get more or less balanced out clusters.

Analyze / identify clusters

We plot the resulting clusters with box plots of the features


Distribution of subjects race for each cluster

Once we have the clusters, we can merge with the original dataset of stops to compute the distribution of stops for each cluster. We can then compare the distributions for all clusters and see if a pattern emerges. \ We proceed as follows: take all the stops a cluster performed and count the number of stops there is for each (subject) race. Then, since the distribution of stops isn't uniform (overall whites represent a larger portion of those who are stopped, as there are a lot more white people in Pittsburgh) we adjust our values by dividing by the fraction this race represents (for example if blacks represent 0.2 of total stops and whites 0.5, and we have 10 stops of both blacks and whites, the adjusted values will be $\frac{10}{0.2} = 50$ for blacks and $\frac{10}{0.5} = 20$ for whites. Finally, we normalize across each cluster so that for each cluster we have the percentage of total stops each race represents. \ When plotting, we only compare subjects that are black or white as they represent a huge portion of our data (other races such as asians or hispanics don't have enough data for us to show anything meaningful).

We redo the whole process to regenerate clusters and the heatmap for black officers as well, so that we can compare it with the heatmap for white male officers. The only change necessary in the code is when we create the variable white_officers. To select all black officers we simply do officer_data.query('race == 1').


Neighborhood plots

We use the previously generated / complied data about neighborhoods to plot some maps.

Clean and prepare plotting data

Generate plots


Clusters vs neighborhoods

Once again we use the data about neighborhoods but we combine it with the clusters we found.